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Abstract 

This paper deals with gene networks whose dynamics is assumed 
to be generated by a continuous-time, hnear, time invariant, finite 
dimensional system (LTI) at steady state. In particular, we deal with 
the problem of network reconstruction in the typical practical situation 
in which the number of available data is largely insufficient to uniquely 
determine the network. In order to try to remove this ambiguity, we 
will exploit the biologically a priori assumption of network sparseness, 
and propose a new algorithm for network reconstruction having a very 
low computational complexity (linear in the number of genes) so to be 
able to deal also with very large networks (say, thousands of genes). 
Its performances are also tested both on artificial data (generated with 
linear models) and on real data obtained by Gardner et al. from the 
SOS pathway in Escherichia coli. 

Keywords: Functional Genomics, Computational Biology, Gene Net- 
works, Reverse Engineering, Modelling. 



1 Introduction 



In this paper we will consider the most simple - not trivial dynami- 
cal model of a gene network, define a reverse engineering problem} and 
propose a fast algorithm to tackle this problem. We will deal then with 
continuous-time, linear, time invariant, finite dimensional system [LTI sys- 
tems) at steady state, in view of the relevant biological literature (see, for 
example Even though this oversimplified model may not be very realis- 
tic "far" from steady state, nevertheless it is a fundamental tool for studying 
and gaining insight into the basic mechanism that makes this problem an 
hard one, thus providing a valuable in numero testbed for gene networks re- 
construction algorithms. In fact, starting from the simplest and then moving 
toward the more and more complex is a typical scenario in the applied sci- 
ences. Consequently, in this paper we will consider only experiments regard- 
ing steady state measurements due to "small" input constant perturbations, 
so to retain linearity of the model. 

More precisely, we deal with the problem of reconstructing a gene network 
in the typical practical situation in which the number of available data is 
largely insufficient to uniquely determine the network. In order to try to 
remove this ambiguity, we will exploit some additional biologically relevant a 
priori assumptions such as sparseness, as it will be expalined in the following. 

The proposed algorithm has the major advantage to be very fast. In fact, 
its complexity is 0{2NM'^) and consequently it is particularly useful for large 
networks (large A^) and few data (small M). 

Finally we will also generate artificial data for the reconstruction algo- 
rithm. By doing so, one exactly knows the "true" gene regulation network 
and then algorithms performance can be evaluated. Moreover, the algorithm 
will be successfully applied to real data collected from experiments regarding 
a nine-gene subnetwork of the SOS pathway in Escherichia coli and very 
recently presented in jif. 

^In the system and control community, this problem is commonly defined as an identifi- 
cation problem. We prefer here to adopt the terminology used in the biological literature. 
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2 Dynamic model of expression data and prob- 
lem formulation 



The idea of modelling gene expression data with differential equations 
has been explored by a considerable numbers of authors (see for example 
references 0], [H, QjEl, 0, |1| and, for a literature review, the inter- 
ested reader may refer to jof). Differential equations are used to model gene 
interactions under the assumption that the rate of change over time of each 
gene expression level is a function of the expression level of some (usually a 
few) other genes. Such modeling assumption is based on the reaction kinetics 
at the biochemical level. A fully realistic model should consider a number of 
relevant biological issues such as, for example, internal and external noise, 
time delays, specific classes of nonlinearities, and should also consider the 
relationships between mRNA and protein concentrations since only the first 
one is actually measured by microarrays. Clearly, a lot of work remains to be 
done in the field of gene interaction modelling. Nevertheless, a very simple 
linear time invariant model has proved to be useful in a number of cases (as 
in [3] for rat cervical spinal cord data) even if it is clear that nonlinearity is an 
unavoidable issue since it reflects also the nature of biochemical interactions. 
However, as in we will assume the system to behave linearly around the 
steady states points reached in experiments. According to the work of Yeung 
et al. |l] we consider the LTI system described by the following differential 
equations: 

N 

Xi (t) = -\x, (t) + J2 Wij^j (t) + h {t) + 6 {t) (1) 

for i = 1,2, ...,iV, where the state variables the concentration of 

mRNA measured as a difference from the equilibrium state preceding the 
stimulus^, the Aj's are the self-degradation rates, the hiS are the external 
stimuli (depending on the specific experiment performed), and the ^j's rep- 
resent (internal) noise. The elements of the matrix W describe the type 
and strength of the "influence" of the j-th gene on the i-th gene with a 
positive, zero or negative sign indicating activation, no interaction and re- 
pression respectively. As described in jif, an experiment consists in applying 

^Clearly, these numbers can be positive, negative or zero provided that the measured 
concentration levels are greater, less or equal to the concentration present prior to the 
stimulus. 
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a prescribed {i.e. known) stimulus bi (t) which is a persistent perturbation 
(ideally a step function Si{t) of "amplitude" bi, i.e. bi (t) = bi6i(t)) to M in- 
put separately and then use a microarray to measure the response at steady 
state. Finally, the above situation, in view of equations and taking into 
account the measurement noise w introduced by the microarray device, can 
be formally described in the usual compact form as follows: 

x = Ax + B + ^ (2) 

y = X + w 

where the matrix Aisa. NxN matrix which incorporates both self-degradation 
rates (on its main diagonal entries) and the strength of the gene-to-gene in- 
teraction (on its off diagonal entries) and the columns of the N x M matrix 
B are the 6j's. We will assume standard normality properties on zero mean 
noises. 

Steady state (equilibria) solutions are given by Axe + Bue + ^ = = 
Aye + Bue + i — Aw, and if we repeat the M measures a sufficiently large 
number of times, we can calculate average values and then solve Ay + Bu = 0, 
where E[ye\ = y, E[ue] = u and - Aw] = so that E[w] = E[^] = 0. 

We can now state the problem considered in 0]: 



An Identification (Reverse Engineering) 
Problem for Gene Networks 

Given a network described by (0), initially at rest composed of N genes, 
where M independent constant inputs^ are applied and M noisy measure- 
ments y^ are taken at steady state. This process is repeated a "sufficiently" 
large number of times so that we can average the data and obtain the data 
matrix Y^xm, whose generic element is yj where superscripts denote ex- 
periments (which are then repeated M times), the overbar denotes averaging 
and the subscripts denote individual genes. 

Then, a reverse engineering problem consists in finding the "best" 
matrix A such that 

^NxnYnxM + BnxM = OatxM (3) 
■^The matrix B is known and full rank, possibly diagonal. 
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It is important to note that the matrix A in equation (jH)) is unique if 
and only if M > A^, i.e. provided that the number of experiments is equal 
or greater than the number of genes in the network. In what follows we 
will assume the typical situation in which M -C so that equation Q 
is underdetermined, that is, it has many solutions yielding AY^xm + B = 
0. Thus, in order to find the best choice for the matrix A, some a priori 
information has to be exploited incorporating some biological knowledge into 
the model at hand. One possibility, as discussed in a previous section, is to try 
to impose the additional biological constraint that usually gene networks are 
sparse, i.e. that generally each gene interacts with o nly a small percentage 
of all the genes in the entire genome (see for example |lO|). We will therefore 
exploit this feature and present our algorithm in the following section. 

In the next paragraph, we will first address the problem of generating 
artificial data consistent with the above biological assumption so to be able, 
after presenting a new reverse engineering algorithm, to evaluate its perfor- 
mances on such generated artificial data and, afterwards, also on real data 
taken from regarding the SOS pathway of Escherichia coli. 

3 The artificial data generation 

In this section we will briefly describe how one can set up an artificial 
problem and the data generated by this procedure. First of all we generated 
data assuming a linear model (as in (^) and the algorithms were tested on 
such data. However, in this preliminary work we just wanted to evaluate the 
performances of the algorithms presented in the next paragraphs. We would 
like also to stress that we applied our reconstruction algorithms to artificial 
data, before using real data. The reason simply being that, by doing so, one 
exactly knows the "true" regulation network, and the algorithm performances 
can be evaluated. 

Consequently, in our first artificial experiments, we generate the data 
with the linear model described in equation (^. Each repetition of the i- 
th experiment consists, in this model, in applying a constant unitary input 
(normalized) and then in picking a sample of the trajectory generated by 
equations (H)) at steady state. We generate then a sparse, asymptotically 
stable matrix A, and, as a second step, also the entries of input matrix B for 
each experiment. Finally, we obtain the artificial data by considering steady 
state values, and by repeating this process M times, we get the simulated 
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data matrix YArxM'*- 

4 A reconstruction algorithm for sparse gene 
networks 

As discussed in the previous sections, the reverse engineering problem 
for gene networks when the number of measurement is (much) less than the 
number of genes involved in the network, leads to many alternative networks 
among which we have to find a way to choosing the "best" one according to 
a prespecified cost function. 

As stated by the reverse engineering problem, once we have collected the 
mRNA abundance steady state measurements Ynxm from microarrays with 
M < A^, we may first find one of the many solutions A of AY + B = 0. 
A standard procedure for the case of interest with M < A^, is to perform 
a singular value decomposition on the data matrix YATxAf and obtain the 
matrix A with the smallest L2 norm. The procedure goes as follows. We 
decomposed the data matrix Y^xm as Y^xm = UnxnSnxmV^mxMi where 
UnxN and VF^xA/ are unitary matrices and the entries 0"^ of the matrix 
Snxm are such that aij = for all i ^ j and au > (T22 > • • • > (^mm > 0. 
The numbers an := cTj are the nonnegative square roots of the eigenvalues of 
^ATxM^xM known as the singular values of Yj^xm- As previously stated, the 
SVD provide a simple way to find a solution A to problem (jS)) with minimal 
L2 norm, i.e. with minimal ||v4||2, as^ Agvd = —BWdiagi=i^,,,^M{j-)U'^- All 
the solutions to Q are given by ^4 = Agvd + C, where C is any N x N matrix 
satisfying CnxmYnxm = 0, as one can easily verify by direct substitution. 
Equivalently, the matrix C is such that the columns of belong to the null 
space of Y^xM- 

A possible way of "reasonably" choosing the "best" matrix A without 
evaluating all the feasible solutions, is given next. Our aim is to take into 
account the sparseness assumption, that is try to incorporate information on 
the structure (zero pattern) of the matrix A. In particular, we will concen- 
trate on the case in which the number k of nonzero entries in each row of A 
is not greater than the number of available measurements, i.e. k < M. Con- 

^As previously stated, the expression level of the genes are calculated as the difference 
between the current expression level (the cellular concentration of the gene product). 
^Note that, in this case, we have As^^^tyxm + B = 0. 
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sequently, we have that k < M < N with k N. The proposed procedure, 
hereafter formaUy described, consists of three steps: 

The Fast Reverse Engineering Algorithm 

1. Assuming k < M < N and using the fuU rank data matrix Yjy^M, 
find the minimal L2 norm solution Agvd = —BWdiagi=i ... m{—)U'^ to 
equation AY]sixm + -B = using SVD of the data matrix Y|vxm- 

2. For each of the N rows of Asu^, determine the smallest in magnitude 
N — M + r) with r] >1 entries and consider them as " zero" . 

3. Taking into account the zero pattern detected at the previous step, find 
row by row the unique (sparse) matrix A solution of min arg 1 1 AYj^ -^m + B 
using the pseudo-inverse operator on the selected submatrix, and we 
are done. 

Here, the basic idea is to exploit the fact that, at the first step, the 
computed matrix Asvd is that of minimal L2 norm, so that it is reasonable 
to assume that the smallest magnitude entries correspond to zeros in the 
"true" matrix A. The main advantage of this approach is its simplicity and 
low computational complexity as it will discussed later in the paper. It's 
worth noting that Step 3 requires only that, for each of the rows of A, 
an (M — r]) X M matrix to be pseudo-inverted and this can be done in a 
very efficient computational way. The complexity of the algorithm is then 
0{2NM'^) so it is particularly useful for large networks (large N) and few 
data (small M). 

We compare here this algorithm with another one which makes use of 
an heuristics to find the zero pattern for each row of the matrix. Instead of 
applying Step 3 (of the Fast algorithm), for each row i, we put to zero one 
entry, say aij, of the matrix Agvd and denote this modified matrix by Asvd,ij- 
We evaluate then the error index Sij = \\Agy(i,ijY + B\\2. The position of the 
first zero in the i-th row is then that corresponding to the minimal value of 
6ij. We repeat this process until we have placed all the N — M + rj zeros. 
More precisely, using this criterion, we find for each row of the matrix Agvd 
the N — M + r} zero locations and consequently we can solve uniquely the 
system by selecting the M — rj nonzero elements for each row i. 

Simulation results are provided and discussed in the following section. 
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Figure 1: The percentage of errors estimated by the two algorithms vs. the 
number of available measurements (case of 50 genes). Data are generated with the 
linear model. 

5 Results on artificial data 

In this section, we will present the simulation results and the performances 
of the proposed reconstruction algorithm. First of all, we generate artificial 
data as described in the previous sections, then we estimate the matrix A, 
applying the reconstruction algorithms with r/ = 1. We measure the error by 

counting the percentage of sign discrepancies E = 100 — ^2 where 



In Figure we show the simulation results of a network composed by 50 
genes. We have generated the data using the linear model. In the figure it 
is clear how both the algorithms we propose produce good results. 

In Figure El we show the computing time^ of the two algorithms. It is 
clear that the proposed Fast algorithm grows much less than Greedy search. 
If the number of the genes of the analyzed network increases (say 200, 500, 
1000 or more genes) any exhaustive or greedy search, takes to much time to 
be computed. This is the strength of the algorithm we propose here. 

^The information on the sign of the entries is the most relevant information from a 
biological point of view since reflect the nature of the gene-to-gene interaction. 

^The algorithms described here have been fully implemented with Matlab. They ran 
on a Pentium IV with a clock speed of 1.8 GHz 
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if sign (ttij) = sign (uij) 
otherwise 
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Figure 2: The computing time of the two algorithms vs. the number of genes in 
the network with M = N/2. 

6 Results on real data 

As an example of application of the algorithm proposed in this paper, 
we consider the data obtained in for a nine-gene subnetwork of the SOS 
pathway in Escherichia coli. The collected data, i.e. the diagonal input 
matrix .Bgxg and the data matrix Ygxg, is reported in the web supplement of 
the paper^. 

It is apparent from the the table in Figure El that the estimated matrix 
A performs quite well since the numbers of correct gene interaction signs 
(as reported by the biological literature on the SOS pathway in E. coli) is 
not very different from that of Gardner et al. It is very important to note 
that the approach used by Gardner et al. considers, for each row, all the 
possible positions of the 4 zeros in the dynamic matrix A, thus leading to 
a combinatorial approach since it evaluates every feasible solution. This is 
viable only for small subnetworks, while our method is suitable also for very 
large number of genes {e.g. the Saccaromyces Cerevisiae has about 6000 
genes!) being polynomial in the number of genes. 

7 Conclusions 

In this paper we have addressed the so called "reverse engineering" prob- 
lem, that is the problem of reconstructing the gene-to-gene interactions from 
expression data (noisy samples of the state space trajectories) in the usual 

®http://www. sciencemag.org/cgi/contcnt/full/301/5629/102/DCl 
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Figure 3: The table (labelled with gene names) in the upper part of the picture is 
the identified matrix by Gardner et al., those in the lower part is obtained using 
our method. Numbers in bold indicate a correct sign. Note that a positive sign on 
the main diagonal denotes positive self-regulation apart from the self degradation 
rate, so that stability is guaranteed. The recF row must be disregarded since the 
experiments didn't yield reliable data. 

case of "few" samples and "many" genes. The typical situation is ~ 10 
samples for gene networks with ~ 5000 genes, as in |ll|. 

We have proposed a fast algorithm for finding a "reasonable" solutions 
among the many feasible by exploiting the a priori biological information 
about sparseness of the gene networks. However, this hypothesis in too 
generic for selecting a "small" set of solutions, but it is of great help. The 
main problem is that we only know that there are "many" zeros, but no as- 
sumptions are made about the locations of such zeros. This makes even the 
problem of imposing sparseness a difficult problem since it has an intrinsic 
combinatorial complexity. To avoid this, in the literature many approaches 
have been developed mainly using a greedy search strategy, but still, the 
computation burden is very high. In this paper we have proposed and evalu- 
ated on artificial and real data, an algorithm with a very low computational 
complexity (linear in the number of genes) which can be used also for genome 
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wide measurements. From this point of view, the algorithm presented seems 
very promising and future work will be devoted to explore its potential on 
large scale real data. 
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